AITopics | Rahway

Collaborating Authors

Rahway

PepEVOLVE: Position-Aware Dynamic Peptide Optimization via Group-Relative Advantage

Nguyen, Trieu, Pang, Hao-Wei, Feng, Shasha

arXiv.org Artificial IntelligenceNov-24-2025

Macrocyclic peptides are an emerging modality that combines biologics-like affinity with small-molecule-like developability, but their vast combinatorial space and multi-parameter objectives make lead optimization slow and challenging. Prior generative approaches such as PepINVENT require chemists to pre-specify mutable positions for optimization, choices that are not always known a priori, and rely on static pretraining and optimization algorithms that limit the model's ability to generalize and effectively optimize peptide sequences. We introduce PepEVOLVE, a position-aware, dynamic framework that learns both where to edit and how to dynamically optimize peptides for multi-objective improvement. PepEVOLVE (i) augments pretraining with dynamic masking and CHUCKLES shifting to improve generalization, (ii) uses a context-free multi-armed bandit router that discovers high-reward residues, and (iii) couples a novel evolving optimization algorithm with group-relative advantage to stabilize reinforcement updates. During in silico evaluations, the router policy reliably learns and concentrates probability on chemically meaningful sites that influence the peptide's properties. On a therapeutically motivated Rev-binding macrocycle benchmark, PepEVOLVE outperformed PepINVENT by reaching higher mean scores (approximately 0.8 vs. 0.6), achieving best candidates with a score of 0.95 (vs. 0.87), and converging in fewer steps under the task of optimizing permeability and lipophilicity with structural constraints. Overall, PepEVOLVE offers a practical, reproducible path to peptide lead optimization when optimal edit sites are unknown, enabling more efficient exploration and improving design quality across multiple objectives.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2511.16912

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > New Jersey > Union County > Rahway (0.04)
North America > United States > Florida (0.04)
(2 more...)

Genre:

Workflow (0.46)
Research Report (0.40)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.68)
Health & Medicine > Therapeutic Area > Immunology (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.94)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.68)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)

Add feedback

PepThink-R1: LLM for Interpretable Cyclic Peptide Optimization with CoT SFT and Reinforcement Learning

Wang, Ruheng, Zhang, Hang, Nguyen, Trieu, Feng, Shasha, Pang, Hao-Wei, Yu, Xiang, Xiao, Li, Zhang, Peter Zhiping

arXiv.org Artificial IntelligenceNov-21-2025

Designing therapeutic peptides with tailored properties is hindered by the vastness of sequence space, limited experimental data, and poor interpretability of current generative models. To address these challenges, we introduce PepThink-R1, a generative framework that integrates large language models (LLMs) with chain-of-thought (CoT) supervised fine-tuning and reinforcement learning (RL). Unlike prior approaches, PepThink-R1 explicitly reasons about monomer-level modifications during sequence generation, enabling interpretable design choices while optimizing for multiple pharmacological properties. Guided by a tailored reward function balancing chemical validity and property improvements, the model autonomously explores diverse sequence variants. We demonstrate that PepThink-R1 generates cyclic peptides with significantly enhanced lipophilicity, stability, and exposure, outperforming existing general LLMs (e.g., GPT-5) and domain-specific baseline in both optimization success and interpretability. To our knowledge, this is the first LLM-based peptide design framework that combines explicit reasoning with RL-driven property control, marking a step toward reliable and transparent peptide optimization for therapeutic discovery.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2508.14765

Country:

North America > United States > New Jersey > Union County > Rahway (0.04)
North America > United States > Texas > Dallas County > Dallas (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Asia > Thailand > Bangkok > Bangkok (0.04)

Genre: Research Report > New Finding (0.67)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Multitask finetuning and acceleration of chemical pretrained models for small molecule drug property prediction

Adrian, Matthew, Chung, Yunsie, Boyd, Kevin, Paliwal, Saee, Veccham, Srimukh Prasad, Cheng, Alan C.

arXiv.org Artificial IntelligenceOct-15-2025

Chemical pretrained models, sometimes referred to as foundation models, are receiving considerable interest for drug discovery applications. The general chemical knowledge extracted from self-supervised training has the potential to improve predictions for critical drug discovery endpoints, including on-target potency and ADMET properties. Multi-task learning has previously been successfully leveraged to improve predictive models. Here, we show that enabling multitasking in finetuning of chemical pretrained graph neural network models such as Kinetic GROVER Multi-Task (KERMT), an enhanced version of the GROVER model, and Knowledge-guided Pre-training of Graph Transformer (KGPT) significantly improves performance over non-pretrained graph neural network models. Surprisingly, we find that the performance improvement from finetuning KERMT in a multitask manner is most significant at larger data sizes. Additionally, we publish two multitask ADMET data splits to enable more accurate benchmarking of multitask deep learning methods for drug property prediction. Finally, we provide an accelerated implementation of the KERMT model on GitHub, unlocking large-scale pretraining, finetuning, and inference in industrial drug discovery workflows.

artificial intelligence, deep learning, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2510.12719

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
North America > United States > New Jersey > Union County > Rahway (0.05)
North America > United States > California > Santa Clara County > Santa Clara (0.04)
(2 more...)

Genre: Research Report (0.50)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Target alignment in truncated kernel ridge regression

Neural Information Processing SystemsAug-16-2025, 18:21:53 GMT

We show that TKRR can achieve a fast rate of convergence in the over-aligned regime.

alignment, artificial intelligence, machine learning, (13 more...)

Neural Information Processing Systems

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.14)
North America > United States > New Jersey > Union County > Rahway (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.46)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.51)

Add feedback

Multivariate Conformal Selection

Bai, Tian, Zhao, Yue, Yu, Xiang, Yang, Archer Y.

arXiv.org Machine LearningMay-5-2025

Selecting high-quality candidates from large datasets is critical in applications such as drug discovery, precision medicine, and alignment of large language models (LLMs). While Conformal Selection (CS) provides rigorous uncertainty quantification, it is limited to univariate responses and scalar criteria. To address this issue, we propose Multivariate Conformal Selection (mCS), a generalization of CS designed for multivariate response settings. Our method introduces regional monotonicity and employs multivariate nonconformity scores to construct conformal p-values, enabling finite-sample False Discovery Rate (FDR) control. We present two variants: mCS-dist, using distance-based scores, and mCS-learn, which learns optimal scores via differentiable optimization. Experiments on simulated and real-world datasets demonstrate that mCS significantly improves selection power while maintaining FDR control, establishing it as a robust framework for multivariate selection tasks.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Machine Learning

2505.00917

Country:

North America > Canada > Quebec > Montreal (0.14)
Europe > Portugal > Braga > Braga (0.04)
North America > United States > New Jersey > Union County > Rahway (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.87)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.46)

Add feedback

SHACL-SKOS Based Knowledge Representation of Material Safety Data Sheet (SDS) for the Pharmaceutical Industry

Lu, Brian, Pham, Dennis, Chang, Ti-Chiun, Lovette, Michael, Bui, Terri, Ma, Stephen

arXiv.org Artificial IntelligenceFeb-11-2025

We report the development of a knowledge representation and reasoning (KRR) system built on hybrid SHACL-SKOS ontologies for globally harmonized system (GHS) material Safety Data Sheets (SDS) to enhance chemical safety communication and regulatory compliance. SDS are comprehensive documents containing safety and handling information for chemical substances. Thus, they are an essential part of workplace safety and risk management. However, the vast number of Safety Data Sheets from multiple organizations, manufacturers, and suppliers that produce and distribute chemicals makes it challenging to centralize and access SDS documents through a single repository. To accomplish the underlying issues of data exchange related to chemical shipping and handling, we construct SDS related controlled vocabulary and conditions validated by SHACL, and knowledge systems of similar domains linked via SKOS. The resulting hybrid ontologies aim to provide standardized yet adaptable representations of SDS information, facilitating better data sharing, retrieval, and integration across various platforms. This paper outlines our SHACL-SKOS system architectural design and showcases our implementation for an industrial application streamlining the generation of a composite shipping cover sheet.

artificial intelligence, information fusion, taxonomy, (13 more...)

arXiv.org Artificial Intelligence

2502.07944

Country:

North America > United States > New York > New York County > New York City (0.04)
North America > United States > New Jersey > Union County > Rahway (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
(5 more...)

Genre: Research Report (0.40)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Government > Regional Government > North America Government > United States Government (0.47)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Ontologies (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Information Fusion (1.00)

Add feedback

Doubly Robust Conformalized Survival Analysis with Right-Censored Data

Sesia, Matteo, Svetnik, Vladimir

arXiv.org Artificial IntelligenceDec-12-2024

Survival analysis is a key area of statistics that focuses on modeling time-to-event data, with applications in many fields including clinical trials, engineering, and marketing. For example, in a clinical trial, researchers may aim to predict how long a cancer patient is likely to survive based on individual characteristics and treatments received. Two central goals are modeling the probability that an event will not occur before a given time and predicting the actual event time. These tasks are typically complicated by the fact that the data are censored--the exact event time may be unknown due to study limitations or participant withdrawal. While traditional methods, such as the Kaplan-Meier estimator and parametric models like the Cox proportional hazards model (Cox, 1972), are valued for their interpretability, they tend to struggle in high-dimensional settings or when their assumptions are violated. As a result, machine learning (ML) approaches are gaining popularity (Ishwaran et al., 2008; Katzman et al., 2018), despite the difficulty of obtaining uncertainty estimates and statistical guarantees. A promising approach to integrating ML models with rigorous statistical guarantees in survival analysis was recently introduced by Candès et al. (2023) and refined by Gui et al. (2024). Their conformal inference (Vovk et al., 2005; Lei & Wasserman, 2014) framework can use any survival

artificial intelligence, dr cosarc, machine learning, (13 more...)

arXiv.org Artificial Intelligence

2412.09729

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.28)
North America > United States > New Jersey > Union County > Rahway (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Add feedback

Bayesian Joint Additive Factor Models for Multiview Learning

Anceschi, Niccolo, Ferrari, Federico, Dunson, David B., Mallick, Himel

arXiv.org Machine LearningJun-2-2024

It is increasingly common in a wide variety of applied settings to collect data of multiple different types on the same set of samples. Our particular focus in this article is on studying relationships between such multiview features and responses. A motivating application arises in the context of precision medicine where multi-omics data are collected to correlate with clinical outcomes. It is of interest to infer dependence within and across views while combining multimodal information to improve the prediction of outcomes. The signal-to-noise ratio can vary substantially across views, motivating more nuanced statistical tools beyond standard late and early fusion. This challenge comes with the need to preserve interpretability, select features, and obtain accurate uncertainty quantification. We propose a joint additive factor regression model (JAFAR) with a structured additive design, accounting for shared and view-specific components. We ensure identifiability via a novel dependent cumulative shrinkage process (D-CUSP) prior. We provide an efficient implementation via a partially collapsed Gibbs sampler and extend our approach to allow flexible feature and outcome distributions. Prediction of time-to-labor onset from immunome, metabolome, and proteome data illustrates performance gains against state-of-the-art competitors. Our open-source software (R package) is available at https://github.com/niccoloanceschi/jafar.

bayesian joint additive factor model, jafar, matrix, (11 more...)

arXiv.org Machine Learning

2406.00778

Country:

North America > United States > North Carolina > Durham County > Durham (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > New Jersey > Union County > Rahway (0.04)
Europe > Germany > Hesse > Darmstadt Region > Darmstadt (0.04)

Genre:

Research Report > New Finding (0.67)
Research Report > Experimental Study (0.66)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.93)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.67)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.46)

Add feedback

QComp: A QSAR-Based Data Completion Framework for Drug Discovery

Yang, Bingjia, Chung, Yunsie, Yang, Archer Y., Yuan, Bo, Yu, Xiang

arXiv.org Artificial IntelligenceMay-19-2024

In drug discovery, in vitro and in vivo experiments reveal biochemical activities related to the efficacy and toxicity of compounds. The experimental data accumulate into massive, ever-evolving, and sparse datasets. Quantitative Structure-Activity Relationship (QSAR) models, which predict biochemical activities using only the structural information of compounds, face challenges in integrating the evolving experimental data as studies progress. We develop QSAR-Complete (QComp), a data completion framework to address this issue. Based on pre-existing QSAR models, QComp utilizes the correlation inherent in experimental data to enhance prediction accuracy across various tasks. Moreover, QComp emerges as a promising tool for guiding the optimal sequence of experiments by quantifying the reduction in statistical uncertainty for specific endpoints, thereby aiding in rational decision-making throughout the drug discovery process.

assay, dataset, qcomp, (13 more...)

arXiv.org Artificial Intelligence

2405.11703

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
North America > Canada > Quebec > Montreal (0.14)
Asia > Macao (0.06)
(5 more...)

Genre: Research Report (1.00)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Can Foundational Large Language Models Assist with Conducting Pharmaceuticals Manufacturing Investigations?

Salami, Hossein, Smith-Goettler, Brandye, Yadav, Vijay

arXiv.org Artificial IntelligenceApr-23-2024

General purpose Large Language Models (LLM) such as the Generative Pretrained Transformer (GPT) and Large Language Model Meta AI (LLaMA) have attracted much attention in recent years. There is strong evidence that these models can perform remarkably well in various natural language processing tasks. However, how to leverage them to approach domain-specific use cases and drive value remains an open question. In this work, we focus on a specific use case, pharmaceutical manufacturing investigations, and propose that leveraging historical records of manufacturing incidents and deviations in an organization can be beneficial for addressing and closing new cases, or de-risking new manufacturing campaigns. Using a small but diverse dataset of real manufacturing deviations selected from different product lines, we evaluate and quantify the power of three general purpose LLMs (GPT-3.5, GPT-4, and Claude-2) in performing tasks related to the above goal. In particular, (1) the ability of LLMs in automating the process of extracting specific information such as root cause of a case from unstructured data, as well as (2) the possibility of identifying similar or related deviations by performing semantic search on the database of historical records are examined. While our results point to the high accuracy of GPT-4 and Claude-2 in the information extraction task, we discuss cases of complex interplay between the apparent reasoning and hallucination behavior of LLMs as a risk factor. Furthermore, we show that semantic search on vector embedding of deviation descriptions can be used to identify similar records, such as those with a similar type of defect, with a high level of accuracy. We discuss further improvements to enhance the accuracy of similar record identification.

deviation, incident, language model, (15 more...)

arXiv.org Artificial Intelligence

2404.15578

Country: North America > United States > New Jersey > Union County > Rahway (0.04)

Genre: Research Report > New Finding (0.34)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback